Validating machine learning models#
Once we built a machine learning model, we need to validate that this model learnt something meaningful from our training. This part is machine learning validation.
Validating a machine learning model is essential in developing any data-driven solution.
It ensures that the model performs as intended and has learned relevant patterns from the data. Validation involves assessing a model’s accuracy, reliability, and generalization performance. Machine learning validation is crucial because models can easily overfit the training data, making them unreliable in real-world scenarios.
This process involves splitting the data into training and validation sets, evaluating the model’s performance on the validation set, and tuning the model parameters until an acceptable level of performance is achieved.
How To#
from sklearn.model_selection import train_test_split
import pandas as pd
df = pd.read_csv("data/housing.csv")
df.head()
longitude | latitude | housing_median_age | total_rooms | total_bedrooms | population | households | median_income | median_house_value | ocean_proximity | |
---|---|---|---|---|---|---|---|---|---|---|
0 | -122.23 | 37.88 | 41.0 | 880.0 | 129.0 | 322.0 | 126.0 | 8.3252 | 452600.0 | NEAR BAY |
1 | -122.22 | 37.86 | 21.0 | 7099.0 | 1106.0 | 2401.0 | 1138.0 | 8.3014 | 358500.0 | NEAR BAY |
2 | -122.24 | 37.85 | 52.0 | 1467.0 | 190.0 | 496.0 | 177.0 | 7.2574 | 352100.0 | NEAR BAY |
3 | -122.25 | 37.85 | 52.0 | 1274.0 | 235.0 | 558.0 | 219.0 | 5.6431 | 341300.0 | NEAR BAY |
4 | -122.25 | 37.85 | 52.0 | 1627.0 | 280.0 | 565.0 | 259.0 | 3.8462 | 342200.0 | NEAR BAY |
df = df.dropna()
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1),
df.median_house_value, test_size=.5, stratify=df.ocean_proximity)
x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
from sklearn.ensemble import RandomForestRegressor
model = RandomForestRegressor().fit(x_train, y_train)
model.score(x_val, y_val)
0.6505743884778422
Cross-validation#
from sklearn.model_selection import cross_val_score, cross_val_predict
cross_val_score(model, x_val, y_val)
array([0.64030317, 0.63953665, 0.67780258, 0.61851229, 0.60711769])
cross_val_predict(model, x_test, y_test)
array([199369. , 142416. , 185627.02, ..., 135886. , 155118. ,
419845.36])
Dummy Models#
from sklearn.dummy import DummyClassifier, DummyRegressor
from sklearn.ensemble import RandomForestClassifier
dummy = DummyRegressor(strategy="mean")
dummy.fit(x_train, y_train)
DummyRegressor()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DummyRegressor()
dummy.score(x_val, y_val)
-0.00012451028437854283
cross_val_predict(dummy, x_test, y_test)
array([204526.08000979, 204526.08000979, 204526.08000979, ...,
203670.76834638, 203670.76834638, 203670.76834638])
x_train, x_, y_train, y_ = train_test_split(df.drop(["longitude","latitude", "ocean_proximity", "median_house_value"], axis=1),
df.ocean_proximity, test_size=.5)
x_val, x_test, y_val, y_test = train_test_split(x_, y_, test_size=.5)
dummy = DummyClassifier(strategy="prior")
dummy.fit(x_train, y_train)
DummyClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DummyClassifier()
dummy.score(x_val, y_val)
0.4355912294440094
model = RandomForestClassifier().fit(x_train, y_train)
model.score(x_val, y_val)
0.6004306969459671
cross_val_score(model, x_test, y_test)
array([0.57632094, 0.58414873, 0.60665362, 0.57827789, 0.5798237 ])
cross_val_score(dummy, x_test, y_test)
array([0.44716243, 0.44618395, 0.44618395, 0.44618395, 0.44662096])
Exercise#
Try different dummy strategies and how they compare.
dummy = DummyClassifier(strategy=...)